A Statistical Approach to Thai Morphological Analyzer

نویسندگان

  • Asanee Kawtrakul
  • Chalatip Thumkanon
چکیده

Three nontrivial problems of Thai morphological processing are word boundary ambiguity, tagging ambiguity and implicit spelling errors. These problems cause a lot of difficulty to the parser due to the alternative or erroneous chain of word. This work attempts to provide a computational solution, called Word Filtering, to those linguistic phenomena. The filtering process calculates the probabilities of all possible chains of tagged words using a Markov Model. The most likely sequence of tagged word is the one that maximizes the chain probabilities. However, it may be an erroneous chain which has an implicit spelling error. Therefore, the Word Filtering, also, includes the scanning process that detect and correct these errors. Both filtering and scanning process use a statistical data infonuation collected ~om the hand-ta.~ed corpus. The experiment has shown that word filtering can eliminate most of the alternative word sequences. Moreover: this tcelmique is fairly good at the implicit error correction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Morphological Analyzer for Gujarati using Paradigm based approach with Knowledge based and Statistical Methods

Morphological Analyzer is a tool which performs syntactic analysis of a word and finds root form of input inflected word form. Morph analyzer serves as a pre-processing tool for many NLP applications. Significant amount of work has been done in this area for many Indian languages but not much work has been reported for Gujarati language. We present Morph analyzer for Gujarati language. The Morp...

متن کامل

A Gradual Refinement Model for A Robust Thai Morphological Analyzer

This work attempts to provide a robust Thai morphological analyzer which can automatically assign the correct part-of-speech tag to the correct word with time and space efficiency. Instead of using a corpus based approach which requires a large amount of training data and validation data, a new simple hybrid technique which incorporates heuristic, syntactic and semantic knowledge is proposed. T...

متن کامل

The MIRACL Arabic-English Statistical Machine Translation

This paper describes the MIRACL statistical Machine Translation system and the improvements that were developed during the IWSLT 2010 evaluation campaign. We participated to the Arabic to English BTEC tasks using a phrase-based statistical machine translation approach. In this paper, we first discuss some challenges in translating from Arabic to English and we explore various techniques to impr...

متن کامل

Statistical Morphological Tagging and Parsing of Korean with an LTAG Grammar

This paper describes a lexicalized tree adjoining grammar (LTAG) based parsing system for Korean which combines corpus-based morphological analysis and tagging with a statistical parser. Part of the challenge of statistical parsing for Korean comes from the fact that Korean has free word order and a complex morphological system. The parser uses an LTAG grammar which is automatically extracted u...

متن کامل

A Sequence Labeling Approach to Morphological Analyzer for Tamil Language

Morphological analysis is the basic process for any Natural Language Processing task. Morphology is the study of internal structure of the word. Morphological analysis retrieves the grammatical features and properties of a morphologically inflected word. Capturing the agglutinative structure of Tamil words by an automatic system is a challenging job. Generally rule based approaches are used for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997